Preface

Open Rstudio to do the practicals. Note that tasks with * are optional.

R packages

In this practical, a number of R packages are used. The packages used (with versions that were used to generate the solutions) are:

  • survival (version: 3.3.1)
  • memisc (version: 0.99.30.7)
  • ggplot2 (version: 3.3.6)

R version 4.2.1 (2022-06-23 ucrt)

Dataset

For this practical, we will use the heart and retinopathy data sets from the survival package. More details about the data sets can be found in:

https://stat.ethz.ch/R-manual/R-devel/library/survival/html/heart.html

https://stat.ethz.ch/R-manual/R-devel/library/survival/html/retinopathy.html

Data Transformation, Exploration and Visualization

Before starting with any statistical analysis it is important to transform and explore your data set.

Data Transformation

Task 1

  • As mentioned in the manual of the heart data set, age is equal to age - 48. Let’s bring age back to the normal scale. Do not overwrite the variable age, but create a new variable with the name age_orig.
  • Convert the variable surgery into a factor with levels 0: no and 1: yes.

Use the function factor(…) to convert a numeric variable to a factor.

Solution 1

heart$age_orig <- heart$age + 48
heart$surgery <- factor(heart$surgery, levels = c(0, 1), labels = c("no", "yes"))

Task 2

Categorize the variable age from the retinopathy data set as young: [minimum age until mean age) and old: [mean age until maximum age). Give this variable the name ageCat. Print the first 6 rows of the data set retinopathy.

To dichotomize a numeric variable combine the function as.numeric(…) with a logical condition (e.g., as.numeric(X > 2)). This logical condition will split the numeric variable into two parts (young and old). Use the function factor(…) to convert a variable into a factor.

Solution 2

retinopathy$ageCat <- as.numeric(retinopathy$age >= mean(retinopathy$age))
retinopathy$ageCat <- factor(retinopathy$ageCat, levels = c(0, 1), labels = c("young", "old"))
head(retinopathy)

Task 3

Categorize futime from data set retinopathy as follows:

  • short: [minimum futime until 25).
  • medium: [25 until 45).
  • long: [45 until maximum futime).
    Give this variable the name futimeCat. Print the first 6 rows of the data.

Create a variable that is identical to the futime variable (use the name futimeCut). Then use indexing (e.g., X[X < 25]) to select the correct subset of the new variable futimeCut and set it to the new category (e.g., “short”).

E.g. you can create the low category as:
retinopathy$futimeCut <- retinopathy$futime
retinopathy$futimeCut[retinopathy$futime < 25] <- "short"
Now continue with the other categories.

Solution 3

retinopathy$futimeCut <- retinopathy$futime
retinopathy$futimeCut[retinopathy$futime < 25] <- "short"
retinopathy$futimeCut[retinopathy$futime >= 25 & retinopathy$futime < 45] <- "medium"
retinopathy$futimeCut[retinopathy$futime >= 45] <- "long"
head(retinopathy)

Task 4

Create 2 vectors of size 50 as follows:

  • Sex: takes 2 values 0 and 1.
  • Age: takes values from 20 till 80.
  • Convert the Sex variable into a factor with levels 0: female and 1: male.
  • Define the new variable AgeCat as dichotomous with Age <= 50 to be 0 and 1 otherwise.
  • Convert the AgeCat variable into a factor with levels 0: young and 1: old.
  • Overwrite the Age variable by \(\frac{Age-mean(Age)}{sd(Age)}\).

To sample a numeric and categorical variable use the function sample(…). To convert a numeric variable to a categorical use the function factor(…). To dichotomize a numeric variable use the function as.numeric(…).

Solution 4

Sex <- sample(0:1, 50, replace = T)
Age <- sample(20:80, 50, replace = T)
Sex <- factor(Sex, levels = c(0:1), labels = c("female", "male"))
AgeCat <- as.numeric(Age > 50)
AgeCat <- factor(AgeCat, levels = c(0:1), labels = c("young", "old"))
Age <- (Age - mean(Age))/sd(Age)

Task 5

Create a data frame with the name DF as follows:

  • Include the following vectors: Sex, Age, AgeCat form the previous Task.
  • Use the names: Gender, StandardizedAge, DichotomousAge.

Solution 5

DF <- data.frame(Sex, Age, AgeCat)
DF <- data.frame("Gender" = Sex, "StandardizedAge" = Age, "DichotomousAge" = AgeCat)

Task 6

Create 2 vectors of size 150 as follows:

  • Treatment: takes 2 values 1 and 2.
  • Weight: takes values from 50 till 100.
  • Convert the Treatment variable into a factor with levels 1: no and 2: yes.
  • Overwrite the Weight variable by Weight * 1000.
  • Create a data frame including Treatment and Weight.

To sample a numeric and categorical variable use the function sample(…). To convert a numeric variable to a categorical use the function factor(…).

Solution 6

Treatment <- sample(1:2, 150, replace = T)
Weight <- sample(50:100, 150, replace = T)
Treatment <- factor(Treatment, levels = c(1:2), labels = c("no", "yes"))
Weight <- Weight * 1000
data.frame(Treatment, Weight)

Task 7

Create a list called my_list with the following:

  • let: a to i.
  • sex: factor taking the values males and females and length 50.
  • mat: matrix
    1    2
    3    4

To obtain letters use the function letters(…). To sample a numeric and categorical variable use the function sample(…). To convert a numeric variable to a categorical use the function factor(…).

Solution 7

let <- letters[1:9]
sex <- sample(1:2, 50, replace = TRUE)
sex <- factor(sex, levels = 1:2, labels = c("males", "females"))
mat <- matrix(1:4 ,2, 2, byrow = TRUE)
my_list <- list(let = let, sex = sex, mat = mat) 

Data Exploration

Let’s obtain some descriptive statistics.

Task 1

Obtain the mean and standard deviation for the variable age using the heart data set.

Use the functions mean(…) and sd(…).

Solution 1

mean(heart$age)
## [1] -2.484027
sd(heart$age)
## [1] 9.419999

Task 2

Using the retinopathy data set:

  • Obtain the median and interquartile range for age.  
  • Obtain the percentage per type.
  • Check whether there are missing in the variable age.

Use the functions median(…) and IQR(…) to obtain the median and the interquartile range. Load the package memisc and use the function percent(…) in order to obtain the percentages. To check whether there are missing values use the functions sum(is.na(…)).

Solution 2

median(retinopathy$age)
## [1] 16
IQR(retinopathy$age)
## [1] 20
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'memisc'
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
## 
##     as.array
percent(retinopathy$type)
##  juvenile     adult         N 
##  57.86802  42.13198 394.00000
sum(is.na(retinopathy$age)) # any(is.na(retinopathy$age))
## [1] 0

Task 3*

Using the data frame DF from the exercise before (Task 5):

  • Calculate the mean of the variable StandardizedAge.
  • Calculate the standard deviation of the variable StandardizedAge.
  • Calculate the frequencies of the variable Gender.
  • Calculate the frequencies of the variable DichotomousAge.
  • Calculate the frequencies of both variables Gender and DichotomousAge (crosstab table).
  • What are the dimensions of the data.frame?

To calculate the frequencies, use the functions length(…) or table(…). To obtain the dimensions use the function dim(…).

Solution 3*

mean(DF$StandardizedAge)
## [1] 1.891608e-16
sd(DF$StandardizedAge)
## [1] 1
length(DF$Gender[DF$Gender == "female"])
## [1] 22
length(DF$Gender[DF$Gender == "male"])
## [1] 28
table(DF$Gender)
## 
## female   male 
##     22     28
table(DF$Gender, DF$DichotomousAge)
##         
##          young old
##   female    15   7
##   male      16  12
dim(DF)
## [1] 50  3

Task 4

Obtain the pearson and spearman correlation of the variables year and age of the heart data set.

To calculate the correlations, use the function cor(…) and check the argument method.

Solution 4

cor(heart$year, heart$age, method = "pearson")
## [1] -0.1623965
cor(heart$year, heart$age, method = "spearman")
## [1] -0.1770664

Data Visualization

Let’s visualize the data.

Task 1

Using the heart data set:

  • Create a scatterplot with the variables age and year.
  • Change the labels of the axis. In particular, give the name Age for the x-axis and Year of acceptance for the y-axis.
  • Give a different color to the patients that had a transplant.
  • Add a legend.

Use the function plot(…, xlab, ylab, col). Use the function legend(…) to add a legend to the plot.

Solution 1

plot(heart$age, heart$year)

plot(heart$age, heart$year, xlab = "Age", ylab = "Year of acceptance")

plot(heart$age, heart$year, xlab = "Age", ylab = "Year of acceptance", col = heart$transplant)
legend(-40, 6, c("no", "yes"), col = c("black", "red"), pch = 1)

Task 2

Using the retinopathy data set:

  • Create a boxplot of age per status.
  • Change the colour to blue and green respectively.

Use the function boxplot(…).

Solution 2

boxplot(retinopathy$age ~ retinopathy$status)

boxplot(retinopathy$age ~ retinopathy$status, col = c("blue", "green"))

Task 3*

Using the retinopathy data set:

  • Create a smooth plot of age with risk.
  • Create a density plot of age per type group.

Use the ggplot2 package and the functions: geom_smooth(…) and geom_density(…).

Solution 3*

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:memisc':
## 
##     syms
ggplot(retinopathy, aes(age, risk)) +
geom_smooth(colour = 'black', span = 0.4)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(retinopathy, aes(age, fill = type)) +
geom_density(alpha = 0.25) 

 

© Eleni-Rosalina Andrinopoulou